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Abstract 



Thompson Sampling has recently been shown to be optimal in the 
Bernoulli Multi- Armed Bandit setting [Kaufmann et al. 2012| . This ban- 



dit problem assumes stationary distributions for the rewards. It is often 
unrealistic to model the real world as a stationary distribution. In this 
paper we derive and evaluate algorithms using Thompson Sampling for a 
Switching Multi- Armed Bandit Problem. 

We propose a Thompson Sampling strategy equipped with a Bayesian 
change point mechanism to tackle this problem. We develop algorithms 
for a variety of cases with constant switching rate: when switching occurs 
all arms change {Global Switching), switching occurs independently for 
each arm {Per- Arm Switching), when the switching rate is known and 
when it must be inferred from data. This leads to a family of algorithms 
we collectively term Change-Point Thompson Sampling (CTS). 

We show empirical results of the algorithm in 4 artificial environments, 
and 2 derived from real world data; news click-through [Yahoo! | 2011| and 



foreign exchange data [Dukascopy[ |2012', comparing them to some other 
bandit algorithms. In real world data CTS is the most effective. 



1 Introduction 

Thompson Sampling has recently been shown to be optimal in the Bernoulli 



Multi-Armed Bandit setting [Kaufmann et al. , 2012 . This bandit problem 



assumes stationary distributions for the rewards. It is often unrealistic to model 
the real world as a stationary distribution and algorithms such AdaptEvE have 
been proposed to solve bandit problems in this environment. In this paper we 
concern ourselves with a Switching Multi- Armed Bandit Problem. 

We first review Thompson Sampling, before describing the non- stationary 
environment we are concerned with. We then propose a method to solve this 
problem and review the techniques we employ. We then test our algorithms on 
a variety of environments. 
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1.1 Thompson Sampling 



Thompson Sampling is effectively a probability matching algorithm. The desired 
strategy is to pull an arm with the probability that the arm is the best arm. 
This probability can be written as 

P{ai = a*) = / I{ai = a''\0)P{0\D)dO, (1) 
Je 

where ^ is a model of our arms, a* is the optimal arm, is the i^^ arm and D 
is the history of past rewards. I{x) is the indicator function, which is 1 when 
X is true, and otherwise. The strategy can thus be reduced to sampling from 
the model distribution P{0\D) and then picking the arm that is maximal given 
this model. 

This paper considers multi-armed bandits with Bernoulli arms. Each arm, 
j, delivers a reward of 1 with probability Oj and otherwise. In the stationary 
case the parameter is ^ = (^i, . . . , 6>/e) where k is the number of arms. The arms 
are assumed independent and so P{0\D) is a product of terms P{Oj\Dj)^ where 
Dj are past rewards for are j. Further, since the arms are Bernoulli, for which 
the Beta distribution is a conjugate prior, we can write P{6) as a product of 
Beta distributions; 

k 

P{0\D) = l[P{0,\a,,f3,,D,) (2) 

where aj = reward = 1} + ao and Pj = # {reward = 0} + /3o. 

We can sample from P{6\D) by sampling from all P{6j \aj^ f3j^Dj) and choos- 
ing the arm, j, with largest 6j. 



1.2 Model of Dynamic Environment 

In this paper we assume that the environment changes over time. We assume 
abrupt switching defined by a hazard function, h{t)^ such that, 

n f.^ _ j Oi(t - 1) with probability h(t) , . 

^" 1 ^new-^t/(0,l) l-h{t). 

The algorithms presented are designed with 2 such models in mind. The first 
model we will refer to as the Global Switching model. This model switches at 
a constant rate, when a change point happens all arms change their expected 
rewards. The second model will be referred to as Per-Arm Switching. In this 
model change points occur independently for each arm, such that when the 
expected reward switches for one arm, it is uncorrelated to when all other arms 
switch. 

Examples of this form of changing environment might include stock market 
data where stock prices can change their statistical nature very quickly subject 
to external events. Switching behaviour has been studied in Financial Markets 



before Sorensen, 2007 



Preis et al. , 2011 , and multi-armed bandits have been applied to this field 
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1.3 Regret for Switching Environments 



In a stationary mult i- armed bandit the regret measure we use is taken to be the 
expected difference in reward between our strategy and that of a pohcy which 
always chose the arm with highest expected payoff, /i*. In a switching system 
the arm which is considered optimal changes. In this case a different form of 
regret is often used. [Garivier and Moulines 2008 use the following definition 
of regret. 



R{T) =< Y^{^*-^,J>, 



(4) 



^=0 



where /ij is the highest expected payoff of an arm at time t and where < . > 

denotes expectation over possible sequences of {/ii^}. 

Unless otherwise stated our experiments will report a related quantity, 



i?„(T)=<^J(Ai(VMiJ >, 



(5) 



^=0 



which is the expected number of times a suboptimal arm is pulled. This corre- 



sponds to the results Hartland et al. 2007 report 



2 Switching Thompson Sampling 

In order to perform Thompson Sampling we wish to sample from P{0\Dt-i), 
which is the probability of the arm model given the data so far. In a switching 
system the arms model is only dependent on the data since the last switching 
occurred, but we do not know when this happened. If we did we could just 
do the same Bayesian update as with the standard Bernoulli case to arrive at 
the distribution of our model. Since we do not know the runlength rt we can 
introduce it as a latent variable and marginalise it out. Taking Dt-i as the 
history of rewards and arm pulls seen so far, we can write this as 



probability 
of runlength 



P(^|A_i)=V P{e\Dt-i,rt) P{rt\Dt-i) . (6) 

^ ' ^ __ ^ 



posterior of 
model given data 



Now to sample from P{0\Dt-i) we just need to sample from the P{rt\Dt-i) (the 
runlength distribution) and then given that runlength, sample from P{0\Dt_i^rt) 
to arrive at our arm model 0. We can select the arm to pull that in expectation 
maximises the reward given this model. 
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3 Bayesian Online Change Detection 



Fearnhead and Liu 2007 as well as Adams and MacKay 2007, have indepen- 



dently done work on calculating the online posterior of the runlength. They 
show exact inference on the runlength can be achieved by a simple message 
passing algorithm. Letting Xt be the reward at time t so that Dt = XtU Dt-i. 
The inference procedure can be easily derived as follows. 



P{rt\xt-i,Dt-2) 



P{rt,Xt-i,Dt-2) 
P(x,_i,A-2) 



(7) 



The numerator can then be expressed as 



= ^ P(n,n_i,xt_i, A-2) 

rt-1 

= P{rt,Xt-i\rt-i,Dt-2)P{rt-i,Dt-2) 



(8) 
(9) 



switching rate reward likelihood 



^ P{rt\rt-i) P{xt-i\rt-i,Dt-2)P{rt-i,Dt-2)- 



(10) 



The derivation just applies the rules of probability up to and including equation 
|9] One assumption is made in equation [lOj that the runlength is only dependent 
on the previous time steps runlength. This forms a simple message passing 
algorithm because rt can only take values depending on rt-i. In fact rt = 
rt-i + 1 when switching does not occur and = when it does. P{rt\rt-i) is 
defined by a hazard function h{t). For simplicity in our case this is a constant 
switching rate 7. 

Unfortunately the exact inference has space and time requirements that grow 
linearly in time. The space requirements are linear because at each time step the 
support set of the posterior runlength distribution increases by one, which means 
we have to store information for an extra value of the runlength at every step. 
The update is also linear in time, as the message passing algorithm requires 
an update to each runlength in the support. Adams and MacKay suggest a 
simple thresholding technique to eliminate runlengths with small probability 
mass associated with them. As we can only know in expectation how much 
memory this algorithm will require, an alternative with hard guarantees on 
memory requirements is desirable. Fearnhead and Liu suggest a much more 
sophisticated particle filter resampling step to maintain a finite sample of the 
runlength distribution, which has the benefit that we can be certain on the 
upper limit of space the algorithm requires, this approach is the one taken in 
this paper. 
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3.1 Particle Filters 



A particle filter is a Monte- Carlo method for approximately estimating a sequen- 
tial Bayesian model. Particles are used to represent points in the distribution 
to be estimated and are assigned weights that correspond to their approximate 
probabilities. The number of particles can grow at each time step and so occa- 
sionally some particles need to be thrown away. This leaves us to assign new 
weights to the remaining particles. This procedure is called resampling. 



3.1.1 Stratified Optimal Resampling 



Fearnhead and Clifford 2003 originally proposed optimal resampling. We wish 
to reduce a discrete probability distribution with a support of N discrete points 
down to a stochastic distribution of M discrete points, where the set of M 
points is a subset of the original N. The original N points each have probability 
mass Pi associated with them, and the procedure finds a reweighting of these 
probabilities, Qi such that N — M of the probabilities are 0. The idea is that 
we wish there to be no bias in the sampling procedure, which means that the 
expected value of should be the original probability mass pi. The algorithm 
is optimal in the sense that the expected squared difference between the original 
probabilities pi and the new weights qi are minimised. 
This can be done by the following procedure; 

1. Find such that M = XliLi min(l,p^/A>:) 

2. Sample u from uniform distribution, [7(0, a^) 

3. Iterate through all Pi 

(a) If Pi > hz Then qi = pi 

(b) Otherwise 

i. u = u — Pi 

ii. If < Then qi = n and u = u 

iii. Otherwise qi — ^ 



The particles where Pi = qi are kept with probability 1. The remaining 
particles are such that qi = n with probability pi/n and qi = ^ otherwise. Thus 
their expectation remains the same. 

The worstcase time complexity of this algorithm is O(A^logA^), but it has 
an amortised cost of 0{N) [Fearnhead and Clifford 2003] . 



4 Proposed Inference Models 

We have shown we can perform Thompson Sampling in a switching system by 
splitting the procedure into a stage that samples the runlength since a switch 
occurred and a stage that samples from the arm model given this runlength. 
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The Global Switching and Per- Arm Switching models are appealing due to 
their simplicity. Only one runlength distribution needs to be inferred for Global 
Switching, which does not depend on the number of arms the bandit has. The 
Per- Arm model can store the runlengths for each arm independent, and the 
space requirements grow linearly with respect to the number of arms. Other 
models with more complicated dependencies between arms can quickly become 
intractable. 

4.1 Global switching 

In global switching there is a single change point process across all of the arms 
since when one arm switches distribution so do all other arms. This means 
that the data from every arm pull contributes to the posterior of the single 
runlength distribution. Effectively to sample from the posterior of the full bandit 
model, we first need to sample from the runlength distribution, this gives us 
an estimate of the runlength, which tells us how much data from the past 
our arms can use. Once the global runlength is sampled, we then proceed by 
sampling individually from the posterior distributions of the arms, given only 
the data since the last changepoint (determined by the runlength). The arm 
with the corresponding maximum sample is then pulled. We only need to store 
the posterior probabilities of the given runlengths and the hyperparameters for 
the arm posteriors associated with those runlengths. We will call the runlength 
distribution the Change Point model, and the set of hyperparameters associated 
with each runlength for a given action the Arm model. 

The Change Point model is an approximation of the runlength distribution 
storing a probability for at most N runlengths. Let be the probability of 
having a runlength of i at time t. In this paper the arm rewards are assumed 
to come from a Bernoulli distribution so the hyperparameters stored are the 2 
parameters for the Beta distribution. Let ajj and Pjj be the hyperparameters 
for a runlength of i at time t for arm j. At any point in time t there is a set 
of runlengths Rt C \Rt\ < where for every r e Rt there exists quantities 
a^ j and f3^j. When \Rt\ = N then a resampling step is performed in order 
to reduce the number of runlengths stored. For ease of notation let {wY be 
the set of runlength probabilities at time t and let {<^}j and {PYj be the sets 
of hyperparameters for arm j at time t. Similarly let {aY and {^Y be the set 
of all hyperparameters at time t. The algorithm is presented in pseudocode in 
figure [l] 

We will refer to this algorithm as Global Change-Point Thompson Sampling 
(Global-CTS). 

4.2 Per-arm switching 

The difference in implementation with respect to global switching is that now 
there is a runlength distribution for each arm. That is, for each arm j we have 
a different set of runlength probabilities wl j G {wYj- In the per-arm switching 
model at a timestep t we update the Change Point model associated with the 
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Algorithm 1 Global Change-Point Thompson Sampling 

procedure GLOBAL-CTS(Ar, 7, ckq = 1, (Bq = 1) 

t ^ > Initialise time 

Wq 1, and add to {w}* > Initialise runlength distribution 

For all arms j, ckq j ^ ao > Initialise hyperparameters 

For all arms j, (Bq j ^ (3o 
while Interacting do 

a ^ SelectAction({w}*, {a}*, {/?}*) 

r ^ PullArm(o) 

{w}*^-'- ^ UPDATECHANGEMODELdw}*, {Q:}*,{/5}*,a,r),7) 
{a}\ ^ UPDATEARMMODELS({a}*, a, r) 

if = then 

ParticleResample({w}*+\ {/3}*+^) 
end if 
t ^ t-\-l 
end while 
end procedure 

procedure UpdateChangeModel({w}* , {a}* , , a, r, 7) 
if r = 1 then 

likelihoodi ^ —f — ^^^^ — , For all i s.t. G 

else 

likelihoodi ^ —f — ^^^^ — , For all i s.t. G 
end if 

^ (1 — 7) * likelihoodi * w*, For all i s.t. G {w}* 
Wq"^^ ^ 7 * likelihoodi * 
Normalise {w}*'^^ 
ret urnlw} 
end procedure 

procedure UpdateArmModels({q!}* , {/?}*, a, r) 
if r=l then 



else 

qt + l 



/5!J!,a ^ /5j,a + 1. For all i s.t. (3l^ e 
end if 



"0,j 

/3q+.-^ ^ /5o , For all arms j 
return{a}*+\{/3}*+^ 
end procedure 

procedure ParticleResample({w}*+-^ , {a}*^^ , if^V^^) 

Find set to discard d G D using Stratified Optimal Resampling on {w}*'^'^ 

Discard all 
end procedure 

procedure SelectAction({w}* , {a}*, {/3}*) 
Pick i with probability 
for each arm j do 

sample j ^ Beta{a.\-,^lj) 
end for 

return maxj sample j 
end procedure 



> Set Prior for runlength 



arm that was pulled at t much like via the update equations sketched in [To] 
The Change Point models associated with arms not pulled at t are updated 
differently since the runlength for these arms is independent of the reward we 
received for the arm we actually pulled. The reward likelihood term disappears 
in the update equations for the runlength distribution of unpulled arms. This 
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is shown in equation [13] Since we normalise the distribution at each step we 
can ignore the factor P{xt-i)- This is shown as follows, 



P{rt,Xt-i,Dt-2) = P{xt-i)P{rt,Xt-uDt-2) (11) 

a ^P(ri,r(_i,A-2) (12) 

rt-1 

a ^P(n|n_i,A-2)P(n-i,A-2). (13) 



We will refer to this algorithm as Per- Arm Change-Point Thompson Sam- 
pling (PA-CTS). 



5 Learning the Switching Rate 



Both Wilson et al. 2010 and Turner et al. 2009 have proposed methods for 



learning the hazard function from the data. Wilson et al. method can learn a 
hazard function that is piecewise constant via a hierarchical generative model. 
Turner et al. can learn any parametric hazard rate via gradient descent, but 
from initial investigations appeared to not perform particularly well if the haz- 
ard rate is adapted at every time step. For the purposes of this paper, a constant 
switching rate was assumed which was learned using the approach of Wilson et 
al. 

For the simplest case where we consider a single constant switch rate, Wilson 
et al. model whether a change point occurred as a Bernoulli variable. The 
hyperparameters of this switching rate are those of a Beta distribution and 
can be thought of as the number of times the system has switched, at and 
the number of times it not switched bt. We now compute the joint distribution 
P(rt, at\xt-i^ Dt-2) as oppose to the original distribution P(r^|x^_i, Dt-2)' The 
message passing proceeds in a very similar fashion as before, except now the 
number of particles also grows quadratically rather than linearly. 
In the global switching model the algorithm now keeps track of sets of particles 
^ii a i a associatcd with a runlenght r, learning rate hyperparam- 
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eter a, arm i and time t. The updates are as follows. 

^"0,0 ^ 1 

^r|l,a < TT^^t , ot ^r,alf reward = 1 

t+1 , t — (2 + 1 ^r,i,a t -r j n 

Wr+l,a ^ TT^TJ I Rt ^r,alf ^ward = 



t+1 (2+1 ^r,i,a 



i Af reward = 1 



^r;^+i ^ TV^:;* TT^t — ^r,J^ reward = 



We again use the resampling algorithm of Fearnhead to manage the space re- 
quirements of the algorithm. 

In the Global Switching model there is only 1 runlength distribution, and 
so only 1 switching rate to learn, this leads naturally to an algorithm Non- 
Parametric Global Change-Point Thompson Sampling (NP Global-CTS). With 
Per- Arms there are many possibilities, there could be a single switching rate for 
each of the independent arms, or each arm could have a separate switching rate. 
In this paper we assume each arm has a separate switching rate and call this 
algorithm N on- Parametric Per-Arm Change-Point Thompson Sampling (NP 
PA-CTS). 



6 Tracking Changes In The Best Arm 

The algorithms presented so far attempt to track changes in all arms, irrespec- 
tive of whether they are pulled. The distribution of Oi for an arm not pulled 
will slowly become flatter and thus have higher variance. One of the stated 
assumptions of Adapt-EvE, was that it was only important to track whether 
the perceived best arm has changed distribution. We can modify the algorithms 
to better replicate this assumption. 

The model for the Per-Arm Switching is adapted most simply. Since both the 
Change Point models and the Arm models are independent for each arm, we 
can track the change of the best arm by only updating the Change Point and 
Arm models for the arm that was pulled. For the Global Switching model the 
change is not so clear since all arms share a Change Point model. In this paper 
we updated the shared Change Point model and only updated the Arm model 
for the arm pulled. The question then arises what should the hyperparameters 
for the unpulled arms be that are associated with the new runlength of zero. 
The approach taken here was to set them to the hyperparameters associated 
with a random other runlength in the distribution. 

We can apply the sample method from Wilson to infer the switching rate for 
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these architectures as weh. The algorithms described in this section with be 
denoted by having a "2" appended to the algorithm name. 



7 Experiments 

6 different non- stationary environments were used to evaluate our bandit model. 
4 are based on purely synthetic data, and 2 use data collected from the real 
world. The parameters for all experiments shown were tuned based on the 
PASCAL challenge. Bold denotes best results in all tables. 



7.1 Global Switching Environment 

We first compare the algorithms in an environment with a constant global 
switching rate. Global-CTS and NP Global-CTS were designed for this en- 
vironment and so a-priori we would expect them to perform the best. 

The first set of experiments were a single run of the algorithms working 
in an instance of this environment type with 2 arms. Figures [l] and [2] plot 
example heatmaps of the runlength distributions of some of the algorithms. At 
a particular time, the graphs show the runlength distribution. In the case of the 
PA-CTS and NP PA- GTS there are 2 plots for each algorithm, corresponding 
to the runlength distribution for each arm. The pay off of the 2 arms has 
been superimposed over the top of the plots so that it can be seen how the 
runlength distribution matches up with the changes in the environment From 
the heatmap figures we can see the change point prediction works when applied 
to a bandit problem. As expected the change point distribution looks to be 
more accurate for the Global-CTS and NP Global-GTS algorithms which use 
the Global Switching model, this is because each data point can contribute to 
the posterior runlength distribution. The PA-GTS also performs reasonably 
well even though the amount of data that has influence on each posterior is 
reduced. For the NP PA-GTS algorithm, learning the separate switching rates 
appears to significantly decrease the certainty for a particular runlength. 

An experiment comparing the algorithms in this setting was performed. 
Each run was over a period of 10^ time steps and the experiment was repeated 
100 times. The results are displayed in table [T] All parameters were set as for 
the PASGAL challenge test run. The environment constant switching rate was 
10-1 

Global-GTS performs the best in the environment, which is not surprising since 
the environment fits the algorithms model. NP Global-GTS performs well in 
this too, which suggests that learning the hazard rate for this model may be 
feasible. 



10 



Global-CTS Runlength Distribution 
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Figure 1: Runlength Distribution for Global-CTS and NP Global-CTS in Global 
Switching Environment. The mean payoffs of the arms are super- imposed over 
the distribution. 



7.2 Per- Arm Switching Environment 



The next environment was a switching system where the switching for each arm 
was independent of every other arm. PA-CTS and NP PA-CTS were designed 
with this situation in mind and again a-priori may be expected to perform better. 



An experiment comparing the algorithms was performed with 10^ iterations 
and then repeated 100 times. The results are shown in table [2] As expected the 
PA-CTS algorithm performs best in this environment. NP PA-CTS, the algo- 
rithm corresponding to PA-CTS that learns the hazard rate suffers much more 
regret, which would appear to indicate for the particular model the parameters 
are not being learned quickly enough. The algorithms designed for a Global 
Switching environment also perform reasonably in this sort of environment. 
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PA-CTS Runlength Distribution - Arm 1 
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Figure 2: Runlength Distribution for PA-CTS and NP PA-CTS in Global 
Switching Environment. The mean payoffs of the arms are super-imposed over 
the distribution. 



7.3 Bernoulli Armed Bandit with Random Normal Walk 

The PASCAL challenge environments were found to be periodic, which is not 
the sort of environment our algorithms were intended for. Another simulated 
environment was investigated. In this environment at time,t, each arm, i, was 
Bernoulli with probability of success Oi{t). At each time step the success rate of 
the arm was allowed to drift as a truncated normal walk. That is the probability 
of success for an arm 6i{t) G [0, 1] conditional on 6i{t — 1) G [0, 1] is, 

_(g.(t-l)-g.(t))2 

pmmit-1)) = — - — ■ (14) 

Table [3] shows an comparison of the algorithms. The experiment was run 
100 times, where each run had a period of 10^. The variance of the random 
walk was set to = 0.03. 

In this sort of environment it appears that our algorithms perform better 
than the benchmark algorithms with NP PA-CTS2 achieving the smallest regret. 
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Table 1: Results against Global Switching Environment (given as number of 
mistakes xlO~^± Std. Error) 

Name Regret | Name Regret 



Global-CTS 

PA-CTS 

NP Global-CTS 

NP Global-CTS2 


5.9 ± 0.07 

12.1 ± 0.10 
6.7 ±0.08 
10.3 ± 0.20 


Global-CTS2 
PA-CTS2 
NP PA-CTS 
NP PA-CTS2 


30.5 ± 1.07 

49.6 ± 1.70 
29.4 ±0.95 
25.6 ± 0.86 


UCB 
Random 


178.3 ± 8.20 
333.1 ± 2.09 


DiscountedUCB 


15.5 ± 0.27 



Table 2: Results against Per- Arm Switching Environment (given as number of 
mistakes xlO~^± Std. Error) 



Name 


Regret 


Name 


Regret 


Global-CTS 

PA-CTS 

NP Global-CTS 

NP Global-CTS2 


13.8 ± 0.20 
13.0 ± 0.11 

13.8 ± 0.17 
15.8 ± 0.28 


Global-CTS2 
PA-CTS2 
NP PA-CTS 
NP PA-CTS2 


37.9 ± 1.02 
67.1 ± 1.23 
30.8 ± 0.79 
38.1 ± 0.83 


UCB 
Random 


175.1 ± 7.47 
336.4 ± 1.85 


DiscountedUCB 


16.8 ± 0.28 



7.4 PASCAL Challenge 2006 

The PASCAL Exploration vs. Exploitation Challenge 2006 was a competi- 



tion in a multi- armed bandit problem [Hussain et al. , 2006. The challenge 



revolved around website content optimisation, whereby the options available 
corresponded to different content to present to a user on a website. The chal- 
lenge is a good general test for the algorithms presented in this paper as to 
perform well it was required for the bandit algorithms to be able to work in 
non-stationary environments. The challenge had 6 separate environments in 
which the algorithms needed to perform; Frequent Swap (FS), Long Gaussians 
(LG), Weekly Variation (WV), Daily Variation (DV), Weekly Close Variation 
(WCV) and Constant (C). These environments are artificially generated, where 
the dynamics of the expected payoffs resemble either periodic Gaussian, Sinu- 
soidal or constant signals. 



Hartland et al. 2007 won this competition with the Adapt-EvE algorithm. 
The Adapt-EvE algorithms most prominent feature is its use of a change-point 
detection mechanism. Since the algorithms presented in this paper also use a 
change-point mechanism it is interesting to compare their performance. The 
challenge also provides an environment for which the algorithm was not directly 
designed for and so will hopefully indicate some robustness in their strategy. 
We were unable to implement a version of Adapt-EvE that replicated the per- 
formance reported, so here we are simply replicating the results published. To 
avoid an unfair comparison in other environments we did not run our own im- 
plementation of Adapt-EvE in those environments. 

Table [4] shows a comparison of the Change- Point Thompson Sampling al- 
gorithms (Global-CTS, PA-CTS, NP G lobal-CTS NP PA-CT S) against Adapt- 
EvE Meta-Bandit and Met a- p- Bandit [Hartland et al. , [2007| . The comparison 



also features the algorithm "DiscountedUCB" , which was submitted by Thomas 
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Table 3: Results against Bernoulli Bandit with Truncated Normal Walk (given 
as number of mistakes x 10~^± Std. Error) 



Name 


Regret 


Name 


Regret 


Global-CTS 

PA-CTS 

NP Global-CTS 

NP Global-CTS2 


97.9 ± 0.10 
107.1 ± 0.16 
116.6 ± 0.13 
100.9 ± 0.10 


Global-CTS2 
PA-CTS2 
NP PA-CTS 
NP PA-CTS2 


134.1 ± 0.23 
148.9 ± 0.35 
117.0 ± 0.11 
94.8 ± 0.13 


UCB 
Random 


194.5 ± 3.78 
325.9 ± 0.24 


DiscountedUCB 


162.4 ± 0.47 



Jaksch to the same competition and performed comparably to Adapt-EvE. The 
code for this algorithm was available and so has been included for comparison 
in all other environments. 



Table 4: Results against PASCAL EvE Challenge 2006 (given as number of 
mistakes X 10~^) 





Global-CTS 


Global-CTS2 


Adapt-EvE Meta p 


wcv 


8.9 ± 0.4 


6.9 d= 0.4 


5.5 ± 0.9 


FS 


27.9 ± 2.4 


12.5 ± 1.3 


10.6 ± 1.3 


c 


0.6 ± 0.1 


1.0 ± 0.2 


3.2 ± 0.3 


DV 


17.1 ± 0.3 


6.6 ± 0.3 


6.1 ± 0.7 


LG 


4.4 ± 0.4 


3.4 ± 0.5 


4.3 ± 1.4 


WV 


8.2 ± 0.3 


5.3 d= 0.5 


5.1 ± 0.9 


Total 


67.2 


35.8 


34.7 




PA-CTS 


PA-CTS2 


Adapt-EvE Meta 


WCV 


4.2 ± 0.8 


6.2 ± 0.4 


5.4 ± 0.8 


FS 


13.7 ± 1.6 


15.1 ± 1.7 


14.0 ± 1.9 


C 


3.2 ± 0.4 


2.0 ± 0.3 


2.5 ± 0.5 


DV 


4.5 ± 1.5 


4.9 ± 0.5 


6.2 ± 0.7 


LG 


9.4 ± 2.9 


3.7 ± 0.7 


4.8 ± 1.6 


WV 


4.7 ± 1.7 


5.4 ± 0.5 


4.8 ± 0.8 


Total 


39.6 


37.4 


37.7 




NP Global-CTS 


NP Global-CTS2 


DiscountedUCB 


WCV 


8.9 ± 0.4 


9.0 ± 0.3 


5.3 ± 0.5 


FS 


28.2 ± 2.7 


14.8 ± 1.2 


10.1 ± 1.1 


C 


0.3 ± 0.2 


0.8 ± 0.2 


5.5 ± 0.5 


DV 


17.6 ± 0.3 


16.0 ± 0.3 


7.9 ± 0.9 


LG 


4.4 ± 0.4 


4.0 d= 0.3 


2.9 d= 0.4 


WV 


8.5 ± 0.3 


8.4 ± 0.3 


4.0 ± 0.4 


Total 


67.9 


53.1 


35.7 




NP PA-CTS 


NP PA-CTS2 


Random 


WCV 


12.8 ± 0.7 


10.4 ± 0.4 


25.7 ±0.3 


FS 


23.1 ± 1.2 


23.0 ± 1.9 


49.1 ± 0.5 


C 


15.8 ± 0.4 


1.9 ± 0.2 


20.0 ± 0.1 


DV 


15.1 ± 1.0 


24.2 ± 0.3 


57.2 ± 0.3 


LG 


14.4 ± 2.1 


8.2 ± 0.5 


112.1 ± 9.1 


WV 


12.1 ± 1.1 


11.7 ±0.4 


57.2 ± 0.3 


Total 


93.2 


79.3 


321.3 



7.5 Yahoo! Front Page Click Log Dataset 



Yahoo! 



2011 



have produced a bandit algorithm dataset. The dataset provides 
information about the top story presented to a user on the front page of Yahoo!. 
Each entry in the dataset gives information about a single article presented, the 
time it was presented, contextual information about the user and whether the 
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user "clicked-through" to the article or not. The dataset was designed for the 
contextual bandit problem. Given context of a user the goal is to select an 
article to present to the user as to maximise the expected rate at which users 
click on the article to read more (click-through). The articles also change during 
the dataset, and so bandit algorithms designed specifically for this environment 
also need the ability to modify the number of arms they can select from. 
For the purposes of our experiments we do not concern ourselves with the con- 
textual case, nor do we try to incorporate new articles as they arrive. Instead we 
ignore the context, and we only pick from a set number of articles. This reduces 
the problem to a conventional multi-armed bandit problem. To maximise the 
amount of data used, for each run we randomly selected the set of articles (in 
our case 5 articles) from a list of 100 permutations of possible articles which 
overlapped in time the most. The click-through rates were estimated from the 
data by taking the mean of an articles click-through rate every 1000 time ticks. 



The simulation then proceeded as described by Li et al. 2011 , the results are 



presented in table [5] The regret for each run was normalised by the number 
of arm pulls, since this was different in each run of the simulation. Parameters 
were set as for the PASCAL challenge dataset. 



Table 5: Results against Yahoo! Front Page Click Log Dataset (±Std. Error) 



Name 




Regret 


Name 


Regret 


Global-CTS 


0.489 


± 


0.035 


Global-CTS2 


0.443 ± 0.031 


PA-CTS 


0.522 


± 


0.028 


PA-CTS2 


0.505 ± 0.028 


NP Global-CTS 


0.490 


± 


0.029 


NP PA-CTS 


0.590 ± 0.018 


NP Global-CTS2 


0.530 


± 


0.026 


NP PA-CTS2 


0.563 ± 0.018 


UCB 


0.526 


± 


0.040 


DiscountedUCB 


0.568 ±0.022 


Random 


0.800 


± 


0.001 







7.6 Foreign Exchange Rate Data 



We constructed a final test environment from Foreign Exchange Rate data Dukascopy 



20121. Ask prices for 4 currency exchange rates (GBP-USD, USD-JPY, NZD- 



CHF, EUR-CAD) at a resolution of 2 minutes spanning 7 years were used. This 
amounted to approximation 10^ datapoints per exchange rate pair. The bandit 
problem using this data was set up as follows. Each exchange rate was thought 
of as a 2-armed bandit. It was imagined that the agent could make fictitious 
trades, and could either decide to buy a long call option (if they believe the rate 
will increase) and a short call option (if they believe the rate will go down). To 
turn this into a Bernoulli bandit problem, we ignore the scale of the change and 
provide a reward of 1 if the bandit predicted correctly the rate going up/down 
and otherwise. When the rate remains the same, the agent receives a reward 
of irrespective of their decision. For the purpose of the experiment we imag- 
ine the option length is 100 time ticks, so that the agent has to decide if the 
exchange rate will increase or decrease in 100 time ticks. Although this bandit 
scenario is not true to life, we believe that the underlying data should exhibit 
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some of the characteristics of a switching system for which the algorithms were 
designed Preis et al. 2011 . We can not estimate a "true" average payoff at 



each timestep, and so can not measure the regret of these algorithms, instead 
we report the error. The results are shown in table [6j 



Table 6: Results against Foreign Exchange Bandit Environment (number of 
mistakes xlO~^±Std. Error) 



Name 


Error 


Name 


Error 


Global-CTS 

PA-CTS 

NP Global-CTS 

NP Global-CTS2 


351.9 ± 14.1 
370.4 ± 13.7 
348.2 ± 13.7 

353.2 ± 13.4 


Global-CTS2 
PA-CTS2 
NP PA-CTS 
NP PA-CTS2 


358.0 ± 13.95 
380.9 ± 12.5 
353.5 ± 13.8 
352.0 ± 13.9 


UCB 
Random 


613.9 ± 17.7 
623.3 ± 14.1 


DiscountedUCB 


606.3 ± 16.0 



8 Conclusion 

This paper has explored several algorithms using Thompson Sampling in con- 
junction with Change Point detection. We have shown that they perform well 
in the environments for which they are designed. Bandit scenarios based on 
real-world data such as the Yahoo! dataset and the Foreign Exchange also 
demonstrate their performance. They are shown not to perform as well as ap- 
propriately tuned competing algorithms in the PASCAL challenge. However 
our results suggest that a strategy that just tracks changes in the perceived 
best arm (Global-CTS2,PA-CTS2), similar to Adapt-EvE, works weh. 
Since the model is extremely modular it is hoped that further assumptions can 
be incorporated into the model to improve performance. It is also worth noting 
that non-Bernoulli payoffs can just as easily be used, e.g. Normal distributed 
payoffs. A Bayesian approach also avoids difficulties that arise with handling 
false alarms in change point detection schemes. The algorithms have been de- 
rived from simple models and so are theoretically motivated, however steps still 
need to be taken to provide any theoretic justification for their performance. 
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